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We present the Nencki Genomics Database, which extends the functionality of Ensembl Regulatory Build (funcgen) for the 
three species: human, mouse and rat. The key enhancements over Ensembl funcgen include the following: (i) a user can add 
private data, analyze them alongside the public data and manage access rights; (ii) inside the database, we provide efficient 
procedures for computing intersections between regulatory features and for mapping them to the genes. To Ensembl 
funcgen-derived data, which include data from ENCODE, we add information on conserved non-coding (putative regula- 
tory) sequences, and on genome-wide occurrence of transcription factor binding site motifs from the current versions of 
two major motif libraries, namely, Jaspar and Transfac. The intersections and mapping to the genes are pre-computed for 
the public data, and the result of any procedure run on the data added by the users is stored back into the database, thus 
incrementally increasing the body of pre-computed data. As the Ensembl funcgen schema for the rat is currently not 
populated, our database is the first database of regulatory features for this frequently used laboratory animal. The data- 
base is accessible without registration using the mysql client: mysql -h database.nencki-genomics.org -u public. 
Registration is required only to add or access private data. A WSDL webservice provides access to the database from any 
SOAP client, including the Taverna Workbench with a graphical user interface. 

Database URL: http://www.nencki-genomics.org. 



Introduction 

Analysis of gene co-regulation requires programmatic 
access to large amounts of regulatory genomics data, 
such as the coordinates of genes, chromatin modifications, 
transcription factor (TF) binding sites and/or motifs. It is 
often preferred to access such data in a relational database, 
such as EBI Ensembl database (1). Importantly, this database 
provides data generated by several other projects, includ- 
ing ENCODE (2), VISTA Enhancer Browser (3) and cisRED (4). 
However, in Ensembl, the relevant data are spread among 



several relatively complex schemas, namely, funcgen, com- 
para, and core. Moreover, in the Ensembl database, it is not 
possible for the user to upload, manage and share own 
private data or to compute genome-wide intersections 
between genomic features (overlap on the genomic se- 
quence). Such intersections can easily be computed with 
external programs, such as BED-tools (5, 6) or ChlPseeqer 
(7), but analysis of the result in a relational database 
requires export of the data from the Ensembl database, 
running the computation, and import of the result into 
another database. 
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We developed a database system, named the Nencki 
Genomics Database (NGD), which for the three species cur- 
rently represented in Ensembl funcgen (i.e. human, mouse 
and rat) extends the data and functionality of Ensembl 
funcgen. To the Ensembl-derived data, we add information 
on conserved non-coding (putative regulatory) sequences, 
and on genome-wide occurrences (instances) of transcrip- 
tion factor binding site (TFBS) motifs, from the current 
versions of two major motif libraries: public Jaspar (8, 9) 
and (for the most recent NGD version 71_1) also commercial 
Transfac Professional (Biobase). 

NGD contains public data — derived from Ensembl or pro- 
vided by us, and data submitted by users, which can be made 
public by the user. For efficiency reasons, in the database, we 
separate the instances of TFBS motifs from the remaining 
regulatory features, which we term areas. In addition to 
SQL queries, NGD provides procedures for (i) genomic data 
analysis — area-gene mapping, area-area intersections and 
area-motif intersections and (ii) data management — 
addition/removal, managing access rights and making the 
data public (Figure 1). The results of the intersections are 
pre-computed for the public data, and similarly the results 
of area-gene mapping. The amount of added intersection 
data is significant, with over 6 billions of area-motif inter- 
sections (Table 1). NGD is accessible from a mysql client, and 
its schema is optimized for regulation-related queries. 

A SOAPA/VSDL webservice layer provides API to the data- 
base procedures, and an additional functionality of graph- 
ical visualization of selected NGD content. The webservice 
can be accessed from a GUI-based client, such as Taverna 
Workbench. The public areas added by us (excluding the 
data from Ensembl) and by other users can also be visua- 
lized as external DAS tracks in Ensembl Genome Browser. 
We are currently working to add to NGD a web browser- 
based interface, as part of a portal integrating genomic and 
expression data with analysis tools. 

NGD architecture 

Schemas 

NGD is versioned, corresponding to the underlying version of 
Ensembl. Owing to efficiency and access control issues, NGD 
contains four schemas per each NGD version and species. For 
example, NGD version 71_1 (Ensembl v.71, local NGD version 
1) for the human has four schemas (71_1_hsap_base jpublic, 
71_1_hsap_base, 71_1_hsap _public, 71 _1 _hsap_users), of 
which two are the # base' schema (7l_1_hsap_base _public, 
7l_1_hsap_base) containing tables with the actual data 
(public and submitted by the users, respectively). The 'base' 
schemas are not directly accessible/visible to the user, who has 
access to two schemas: 'public' and 'users' (in this example: 
7l_1_hsap jDublic and 71_l_hsap_users) containing views to 
the corresponding tables of the appropriate 'base' schema. 



Access control 

The access control, built into the 'users' schema, operates 
on a per user and data set basis. For each user and data set, 
there are two possible access levels: 'owner', who can do 
anything with the data and manage access of other users; 
and 'reader', who can see but cannot modify the data or 
the access rights. 

Tables/views 

Similarly to Ensembl, the schema for each NGD version and 
species contains identical tables/views (Figure 2). The views 
in the 'public' and 'users' schema are named the same, but 
provide access to different sets of data, respectively: all the 
public data, and all the users-supplied data to which a given 
user has access. The key to the database is the table dataset 
describing each data set in the database, in terms of its class 
and type, following Ensembl funcgen. The data sets derived 
from Ensembl funcgen can be mapped back to it using the 
funcgen feature _set_id. The actual genomic positions of 
areas and motifs (i.e. motif instances) are contained in the 
tables area and motif (for Jaspar and user-added motifs) or 
motif_transfac (for Transfac). The results of mapping areas 
to genes are stored in the table area_gene_map. The area- 
area intersections reside in the table areajntersection, 
whereas the area-motif interesections reside in the tables 
moth ^intersection (Jaspar and user-added motifs) and 
motif_intersection_transfac (for Transfac). The separation 
of motif and moth '^intersection data between Jaspar and 
Transfac is for efficiency reason. The mappings of Jaspar 
TFBS motifs to the transcription factors, as well as other 
information about data sets, such as a data set name and 
a cell type of origin for experimental data, are stored as 
name:value pairs in the table datasetjattr, linked to the 
table dataset by datasetjd. The Transfac license prohibits 
us from making public the TF-motif mappings, instead, the 
table datasetjattr contains links to the motifs' entries on 
the Transfac web site. 

Additionally, the 'users' schema contains the so-called 

input tables, named with the ending ' instable', used by 

the procedures. The users are allowed to execute select, 
insert , delete queries on the input tables, for the rest 
of the tables only select queries and access through the 
procedures are permitted. Temporary tables can be created 
in the tmpjtables schema. The temporary tables are ses- 
sion-separated and therefore private to each user. 

Procedures 

The 'public' schema contains stored procedures for data ana- 
lysis, which permit genome-wide area-gene mapping, area- 
area intersections and area-motif intersections. The 'users' 
schema, in addition to the procedures for data analysis, con- 
tains procedures for data and access rights management. All 
the procedures operate at a whole data set level. 
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Figure 1. A diagram presenting overview of content and functionality of NGD. 
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Table 1. Pre-computed data in NGD version 71_1 



Database statistics 


Human 


Mouse 


Rat 


Number of area data sets 


551 


69 


3 


Number of area instances 


11994513 


2390956 


1313917 


Number of area intersections 


276694188 


6277878 


57406 


Number of motifs instances (Jaspar) 


324078012 


269409542 


207574820 


Number of area-motif intersections (Jaspar) 


1314579931 


237809904 


77528929 


Number of motifs instances (Transfac) 


1229116246 


989775972 


763762275 


Number of area-motif intersections (Transfac) 


5255764878 


921893795 


301210194 



The numbers of motif data sets were the same for every species: Jaspar core vertebrate: 146, Jaspar PBM: 208, Transfac: 1476. 



The area-gene mapping procedure maps the areas (from 
the given list of data sets) that intersect (overlap or are 
contained in) the ±10 kb flank of the transcription start 
site of every gene to that gene. Different area-area inter- 
section procedures compute intersections between 



(features from) given pairs of data sets, between all pos- 
sible pairs of data sets and intersections of one data set 
with all the other data sets (to which a given user has 
access). Different area-motif intersection procedures com- 
pute intersections between given pairs of area and motif 
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Figure 2. The main tables/views of the NGD 'public' and 'users' schemas. The table motif is nearly identical to the table area, 
with the only difference that its main index column is named motifjd, not areajd. The table areajntersection has the same 
structure as the table motif Jntersection, only has different column names arealjd, area2Jd; instead of: areajd, motifjd. 



data sets, intersection of a given area data set with all motif 
data sets, intersection of a given motif data set with all area 
data sets. The choice of the right intersection procedure is 
facilitated by the use of webservice (see later in the text). 

The procedures for data and access management permit 
addition/deletion of a data set and (optionally) data set attri- 
butes, granting/revoking access and making a data set public. 
There are mechanisms for maintaining the database integrity, 
for example deletion of a data set results in removal of all its 
intersections. The operation of making a data set public is 
irrevocable and a public data set cannot be deleted. 

Algorithms 

The procedures for mapping and intersections use variants 
of the efficient sweep line algorithm (10). The core algo- 
rithm, common to all these procedures, operates as follows: 

(1) All intervals are placed in the set W sorted 
by the left end; 

(2) The sweep (S) is initially empty; 

(3) The sweep always contains intervals, which 
are starting before or exact at the 
x-coordinate of the line (representing the 
sweep) and which are ending after that 
x-coordinate . 

Algorithm iterative step scenario: 

(1) Take next interval (K) from W; 

(2) Delete from S all the intervals that end 
before K starts; 



(3) Add K into S and return pair-wise intersec- 
tion of each pair in S. 

All our data analysis procedures were tested, by compar- 
ing their results, on a number of data sets, to the results of 
identical or equivalent (in the case of area-gene mapping) 
analysis performed with BED-tools (6), with the BED-tools 
results treated as the gold standard. 

Webservice/clients 

A WSDL webservice available at URL http://webservices. 
nencki-genomics.org/genomic?wsdl provides access to the 
functionality of the stored NGD procedures from any 
SOAP client. The webservice functions are augmented in 
several respects compared to the stored procedures. For 
example, the webservice function automatically chooses 
the correct intersection procedure based on the input 
data. The webservice operations that take a long time pro- 
vide support for long-running jobs — they return an identi- 
fier (id) of the submitted job and send email notifications of 
job submission and completion. We prepared python cli- 
ents, taking arguments and options on the command line, 
which facilitate the use of the webservice. There are separ- 
ate clients for loading, management, analysis and plotting 
of the data. We also provide a GUI based interface to the 
plotting function in the form of a Taverna workflow. These 
clients can be downloaded from the Webservice section of 
the online documentation (http://www.nencki-genomics. 
org/wiki/doku.php?id=tutorial:webservices). 
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Example NGD use 

The results of the intersection and mapping procedures are 
pre-computed for the public data (Table 1). Therefore, many 
questions of biological interest can be answered using SQL 
queries. The versatility of NGD stems from the possibility of 
performing arbitrary SQL queries on the intersection and 
mapping data. For example, to identify all the instances of 
the Jaspar motifs in the rat AVID-VISTA regions and to map 
them to the genes, the user could execute the following 
commands at the mysql command prompt: 

USE 71_l_rnor_public; 

SET @ds_id= (SELECT dataset_id from dataset 
WHERE class= 'AVID-VISTA' ) ; 

CREATE TEMPORARY TABLE tmp_tables . av_j aspar 
SELECT area_id, motif_id, motif_dataset_id 
FROM motif_intersection WHERE area_dataset_ 
id=@ds_id; 

ALTER TABLE tmp_tables . av_j aspar ADD INDEX 
(area_id) ; 

CREATE TEMPORARY TABLE tmp_tables . gene_av_- 
j aspar SELECT b.gene_id, a.* FROM tmp_table- 
s.av_j aspar AS a JOIN area_gene_map AS b 
USING (area_id) ; 

For all operations on the public data, a user does not 
need the database procedures, as their results on all these 
data have been pre-computed and are stored in the data- 
base. The procedures become necessary when the user 
wants to analyze own data in the context of the public 
data or data shared by another user. 

Upload of user-supplied area data is possible using the 
database procedure area_reload _proc or the webservice 
function calling this procedure, and similarly for motifs and 
data set attributes. We recommend upload via the webservice, 
which is both easier (no need to directly fill the input tables) 
and more convenient (support for long jobs with email noti- 
fication). The use of the webservice is facilitated by the pro- 
vided python clients accepting command line arguments and 
options. Notably, called with the -h or -help option, each 
client returns a full description of all its arguments and op- 
tions. The uploaded data must be in one of two formats: the 
BED format, described at the UCSC website (http://genome. 
ucsc.edu/FAQ/FAQformat) or the NGD format, described in 
the documentation online (http://www.nencki-genomics.org/- 
wiki/doku.php?id=tutorial:webservices). To upload an ex- 
ample area data set using the python client, the user 
invokes it from the command line, providing own database 
credentials, database version, species, path to the data file and 
the data set description in terms of its type and class: 

client load.py -u <user> -p <passwd> -v 71_1 

-s rnor -f /path_to_data/area_to_load . tsv -t 
'test area' -c 'test class' 



Once the user-supplied data are in the database, they 
can be analyzed and managed using the stored database 
procedures. As an example of calling a stored procedure, 
the user could execute the following commands to com- 
pute the intersections between two pairs of area data 
sets, with the result stored in the table area J intersection: 

INSERT INTO area_intersect ion dataset_ma- 

p in_table (datasetl_id, dataset2_id) VALUES 

(1,2) , (3,4) ; 

CALL area_intersection map proc; 

The use of the remaining procedures is similar. Basically, 
the user needs to fill in an appropriate input table(s) and 
then call the procedure. There are mechanisms preventing 
duplication if a user attempts to compute a result already in 
the database. All the database procedures can be accessed 
via the webservice, notably also using a GUI client, such as 
Taverna Workbench. 

Additionally, the webservice-only function PlotGenomic 
plots a graphical representation of selected NGD content in 
the ±10 kb flank of the transcription start site of a chosen 
gene and also returns this content as tab-separated files. 
More precisely, this function returns instances of selected 
area types, and instances of selected motif types that inter- 
sect any of the returned area instances. This function can be 
conveniently accessed using the command line client, also 
by the public (i.e. unregistred) user. For example, to plot all 
area types and selected motif types intersecting these 
areas, in the ±10 kb flank of the human BDNF, the user 
may execute the following command at the comand line: 

client plot_genomic . py -u public -s hsap -1 

j aspar -g BDNF -m CREBl , CTCF , REST\ 
-a Yyl , Nrsf , CTCF , DNasel 

Importantly, all the functions of our webservice can also 
be called using the GUI interface of Taverna Workbench. 
Figure 3 shows a screen-shot of the same plot generated 
from Taverna Workbench. 

Methods 

Import from Ensembl 

The main source of the data in our database for the human 
and mouse is Ensembl funcgen, from which we import all 
the features in the classes: Enhancer, Histone, Insulator, 
Transcription Factor, Open Chromatin, Polymerase, Search 
Region. The names of funcgen feature_sets (containing the 
ENCODE names) can be found in the table dataset_attr. 
The processing of the ENCODE data is described by the 
Ensembl team under this link http://ftp.ebi.ac.uk/pub/data 
bases/ensembl/encode/integration_data_jan201 1/hg19/unif 
ormTfbs.html. The 'Search Region' class contains the pro- 
moters' predictions — imported into Ensembl funcgen from 



Page 5 of 7 



Database tool 



Database, Vol. 2013, Article ID bat069, doi:10.1093/database/bat069 



Taverna Workbench 2.4.0 



- Design (§} Results E mvExperimenl 0 Service Catalogue 



f Remove 

Click on a run to sec its values 
Click on a service in the diagram 
to see intermediate values (if available) 
2013-0S-20 16 08 55 



Progress report 




✓ Finished M Pause X Cancel 



\ Refresh intern. g Show workflow results 



Workflow results 



Click in tree to view values 
▼ ■ List with 3 values 



Value 2 
Value 3 



A area. types A gene A library A motif. name A separate.motifs A species. short A version ▼ file.data T file. name 



Value type PDF 



^ Save value] 




Figure 3. Visualization of selected NGD content in the ±10 kb flank of a gene from Taverna Workbench. The plotting function 
returns the motifs of the chosen type that intersect any of the shown areas. 



the cisRED database (http://www.cisred.org/) (4). To these 
data, we add information about genes and CpG islands 
from Ensembl core, and about pairwise genome alignments 
(BlastZNet/LastZNet) (11) from Ensembl compara. The 
LastZNet features are additionally filtered — we retain only 
those with the length between 100 and 5000 base pairs. For 
each Ensembl build and species, we use only one coordinate 
system — the most current chromosomal one. 

AVID-VISTA alignments 

The ±10 kb genomic flanks of transcription start sites of 
orthologous genes (Ensembl: ortholog_one2one, apparen- 
t_ortholog_one2one) are aligned with avid (12). The align- 
ment is processed with the program VISTA (13) using the 
default parameters (minimum length: 100 nt, minimum iden- 
tity: 75%, excluded exons). The chromosomal coordinates of 
the resulting conserved non-coding regions are imported into 
the database, filtering away possible duplications. 

Genome-wide TFBS motif finding 

For finding TFBS motifs in the genomic sequences down- 
loaded from Ensembl, we use the command line version of 
the program matrix-scan (quick) (14), using the pre-com- 
puted first order (2 nt.) Markov-chain background models 
for each species, based on both strands of the upstream 
non-coding regions of the genome, e.g. 2nt_upstream- 
noorf_Homo_sapiens_EnsEMBL-ovlp-2str for human. The 



motif scores are converted to P-values, based on the distri- 
butions of scores computed for each motif using the pro- 
gram matrix-distrib. A uniform P-value threshold of 
P< 0.0001 is used to call an instance of a particular motif 
at a given genomic position. We used motifs from the cur- 
rent (12 October 2009) version of Jaspar, including Jaspar 
core vertebrate (8) and Jaspar PBM (9). For the most recent 
NGD version 71_1, we also used all the vertebrate motifs 
from the current version (Spring 2013.1) of Transfac 
Professional (Biobase). 

Discussion 

The solutions alternative to NGD include 

• reference genomic databases and genome browsers: 
Ensembl (1), UCSC (15) 

• regulatory genomics databases, including VISTA 
Enhancer Browser (3), cisRED (4), EELWeb (16) 
MAPPER2 (17), D-light on promoters (18) 

• programs for analysis of motifs content of regula- 
tory regions, including Toucan 2 (19), cREMaG (20), 
oPOSSUM-3 (21) 

• programs for analysis of next-generation sequencing 
data, including BED Tools (5, 6) and ChlPseeqer (7). 

NGD presents several amendments over the existing so- 
lutions listed earlier in the text. Precomputed instances of 



Page 6 of 7 



Database, Vol. 2013, Article ID bat069, doi:10.1093/database/bat069 



Database tool 



Jaspar motifs in Ensembl funcgen are provided only within 
the ChlP-seq regions and are limited to the motifs for the 
TFs represented in the ChlP-seq data, whereas in NGD, all 
Jaspar and Transfac motifs are provided in the whole 
genome (Table 1). Similarly, the aforementioned regulatory 
genomics databases contain TFBS motif instances only in 
predicted regulatory regions, not in the whole genome. 

The intersection functionality offered by UCSC is limited, 
as the UCSC Table Browser intersection procedure does not 
return the full information about the intersecting features 
(as does NGD). Instead, it returns either (i) the full informa- 
tion on the features from one table, but no information on 
the features from the other table or (ii) nucleotide positions 
of the regions in the intersection, but with no reference to 
the original tables. Moreover, UCSC intersections interface 
is not designed for running extensive (e.g. all pairwise) 
intersections. The aforementioned programs for analysis 
of motif content of regulatory regions do not permit inter- 
secting areas. 

BED Tools, available as a standalone program and a 
python library, offers rich functionality for intersections, 
comparable with that of NGD. ChlPseeqer provides func- 
tionality of area intersection, gene mapping and finding 
of known public TFBS motifs. The main difference between 
our system and the last two programs is the fact that in 
NGD, the result of any procedure is stored back in the data- 
base and is available to all users with access rights to the 
underlying data sets, thus incrementally increasing the 
body of pre-computed data. In particular, the results for a 
data sets that are made public also become public. 
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